Skip to content

fix: null-guard responseObject derefs across NSO ActivityPolicies (DLQ leak)#230

Open
ecv wants to merge 3 commits into
mainfrom
fix/gateway-activitypolicy-create-summary-dlq
Open

fix: null-guard responseObject derefs across NSO ActivityPolicies (DLQ leak)#230
ecv wants to merge 3 commits into
mainfrom
fix/gateway-activitypolicy-create-summary-dlq

Conversation

@ecv

@ecv ecv commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

What

Make auditRules[].summary CEL null-safe across all NSO ActivityPolicies that
dereference audit.responseObject.<leaf> in a create/update rule. Guard the leaf
and fall back to audit.objectRef.name.

Why

Prod DLQSlowLeak on gateway.networking.k8s.io-gateway
(error_type=cel_summary, ~120 Gateway audit events/6h, climbing):

rule 0 summary: summary template evaluation failed:
eval "link(audit.responseObject.metadata.name, audit.objectRef)": no such key: name

The create/update rules match on has(audit.requestObject.spec), so a
rejected request (409/422/admission-deny) still matches — but the apiserver
returns a Status object as responseObject, with no spec/metadata.name.
CEL raises no such key, the event goes to the DLQ, and retries fail
identically. Slow steady leak.

An audit of every NSO ActivityPolicy found the same latent bug in 5 more files.

Fixed rules

  • gatewaycreate (the prod leak)
  • backendtlspolicy, connector, connectoradvertisement, trafficprotectioncreate (responseObject.metadata.name)
  • domaincreate + update (responseObject.spec.domainName)

delete rules already guard has(audit.responseObject.spec) — unchanged.
httproute/httpproxy delete use responseObject.spec.hostnames[0] but match
guards has(...hostnames) && size(...) > 0 — safe, unchanged.

Prod remediation

Merge + NSO release → Flux re-applies the corrected ActivityPolicy CRs
(overwriting the live objects); processor immediately retries the DLQ backlog.
No kubectl.

Related (same bug class, separate repos)

The create rule's summary dereferenced audit.responseObject.metadata.name, but
a rejected create (e.g. 409/422) returns a Status object with no metadata.name.
The rule still matched (it only checks requestObject.spec), so CEL evaluation
raised "no such key: name", the event went to the DLQ, and retries failed
identically -- a slow, steady DLQ leak (DLQSlowLeak in prod).

Guard the leaf path and fall back to objectRef.name, preserving the
server-assigned-name display for successful creates while staying null-safe
for rejected ones.

Fixes the prod DLQSlowLeak on policy gateway.networking.k8s.io-gateway
(error_type=cel_summary).
…olicies

Same DLQ-leak class as the gateway create rule: create/update summaries
dereference audit.responseObject.{metadata.name,spec.domainName}, but a
rejected create/update returns a Status object with no spec/metadata.name, so
CEL raises "no such key" and the event leaks to the DLQ.

Guard the leaf and fall back to audit.objectRef.name (always present) across:
- backendtlspolicy, connector, connectoradvertisement, trafficprotection (create)
- domain (create + update)

delete rules already guard has(audit.responseObject.spec) and are unchanged.
@ecv ecv changed the title fix: stop gateway ActivityPolicy DLQ leak on rejected creates fix: null-guard responseObject derefs across NSO ActivityPolicies (DLQ leak) Jun 27, 2026
@scotwells

Copy link
Copy Markdown
Contributor

Interesting, does the log capture what the full audit log payload is? I'm curious why the metadata name isn't available in the response object of the create operation.

@ecv

ecv commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Successful creates have it, only failures don't.

Quoth the raven,

Our create rule matches on has(audit.requestObject.spec) (the request, always valid), so a reject still matches the rule, then the summary derefs responseObject.metadata.name on a Status object → no such key: name → DLQ.

The line in the description is the processor's CEL eval error, not the raw audit payload. The full responseObject is only present in the audit event if the apiserver audit level is RequestResponse (at Metadata/Request it isn't logged at all).

But audit.objectRef.name, which is populated on both success and reject, so. Safer.

@ecv

ecv commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Yes — the processor logs the full audit event. activityprocessor/processor.go:945 ("Failed to evaluate policy") emits the whole thing as eventJSON. Pulled a live one from staging (a networking.datumapis.com-connector create currently re-failing the DLQ retry):

{
  "auditID": "5c93f394-87b7-4108-912a-920facc12774",
  "level": "RequestResponse",
  "verb": "create",
  "objectRef": {
    "apiGroup": "networking.datumapis.com",
    "apiVersion": "v1alpha1",
    "namespace": "default",
    "resource": "connectors"
  },
  "requestObject": {
    "kind": "Connector",
    "metadata": { "generateName": "datum-connect-" },
    "spec": { "connectorClassName": "datum-connect" }
  },
  "responseObject": {
    "apiVersion": "v1",
    "kind": "Status",
    "code": 403,
    "status": "Failure",
    "reason": "Forbidden",
    "message": "connectors.networking.datumapis.com \"datum-connect-2qjq6\" is forbidden: Your request took too long to be checked against your quota. Please try again in a moment — if this keeps happening, contact support.",
    "metadata": {},
    "details": { "group": "networking.datumapis.com", "kind": "connectors", "name": "datum-connect-2qjq6" }
  },
  "responseStatus": { "code": 403, "status": "Failure", "reason": "Forbidden", "details": { "name": "datum-connect-2qjq6" } }
}

This confirms the root cause concretely: the create was rejected (403, quota check timed out), so responseObject is a metav1.Status with metadata: {} — no metadata.name. Hence no such key: name.

But this payload also shows the current fix is incomplete for connector. Note objectRef has no name:

"objectRef": { "apiGroup": "...", "apiVersion": "v1alpha1", "namespace": "default", "resource": "connectors" }

The client used generateName: datum-connect-, and the request was rejected at the quota check before a name was assigned, so objectRef.name is absent too. Our fallback —

has(audit.responseObject.metadata.name)
  ? link(audit.responseObject.metadata.name, audit.objectRef)
  : link(audit.objectRef.name, audit.objectRef)

— takes the else branch and derefs audit.objectRef.name, which raises no such key: name again. The processor logs confirm it: networking.datumapis.com-connector events are still re-failing the DLQ retry (retryCount climbing 0→10) on the fixed branch.

The name is present in the rejected event — at responseObject.details.name / responseStatus.details.name (datum-connect-2qjq6). So the fallback chain needs to go further for generateName resources, e.g.:

has(audit.responseObject.metadata.name) ? link(audit.responseObject.metadata.name, audit.objectRef)
  : has(audit.objectRef.name) ? link(audit.objectRef.name, audit.objectRef)
  : has(audit.responseObject.details.name) ? link(audit.responseObject.details.name, audit.objectRef)
  : link('a connector', audit.objectRef)

Gateway likely uses explicit names (so objectRef.name is populated and the current fix holds there), but any policy whose clients create via generateName — connector at minimum — needs the deeper fallback. I'll audit which of the 6 are generateName-based.

… creates

The earlier fix guarded responseObject.<leaf> and fell back to
audit.objectRef.name, but a live audit event from staging exposed two gaps:

1. generateName creates (connector, connectoradvertisement) rejected before a
   name is assigned have an EMPTY audit.objectRef.name. The fallback then
   derefs audit.objectRef.name on a Status responseObject and raises
   "no such key: name" again, so the event still dead-letters. The name is
   carried on the Status at responseObject.details.name.

2. domain create/update used has(audit.responseObject.spec.domainName), but
   nested has(a.b.c) evaluates the intermediate selection and errors when a
   rejected Status has no spec. This is the same DLQ bug class for every
   rejected domain create/update.

Replace each create/update summary with a per-level-guarded fallback chain:
  responseObject.<leaf> -> objectRef.name -> responseObject.details.name -> literal
The details branch is guarded with has(responseObject.details) because some
Status responses carry no details. domain also guards has(responseObject.spec).

Evidence: activity-processor logs (processor.go:945) on staging show
networking.datumapis.com-connector events re-failing the DLQ retry
(retryCount climbing) on the previously "fixed" objectRef.name branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ecv added a commit to milo-os/milo that referenced this pull request Jun 30, 2026
#672)

## What

Make the `create` rule summaries null-safe in six ActivityPolicies that
dereference `audit.responseObject.metadata.name`. Guard the leaf and
fall back
to `audit.objectRef.name`.

## Why

Same DLQ-leak class as the prod `DLQSlowLeak` on the NSO gateway policy
(milo-os/activity#212). A `create` rule matches on verb/requestObject,
so a
**rejected** create (409/422/admission-deny) still matches — but the
apiserver
returns a `Status` object as `responseObject` with no `metadata.name`.
CEL
raises `no such key: name`, the event goes to the DLQ, and retries fail
identically. Latent until the first rejected create of each kind.

## Fixed (create rules)

- `config/services/activity/policies/resourcemanager/` — project,
organization
- `config/services/activity/policies/iam/` — role, group, serviceaccount
- `config/services/identity/policies/` — serviceaccount

## Remediation

Merge + milo release → the `milo-activity-policies` Flux Kustomization
re-applies the corrected CRs; processor retries any DLQ backlog. No
`kubectl`.

## Related

- NSO fix (gateway + 5 more): datum-cloud/network-services-operator#230
- billing fix: milo-os/billing (separate PR)
- Runbook fix: milo-os/activity#213 · Investigation:
milo-os/activity#212
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants